This case study is the capstone project of the Google Data Analytics course. The capstone brings everything Iβve learned together.
I used the dataset provided in the course: Divvy_Trips_2019_Q1. This is a fictional dataset created for educational purposes, so its source and credibility cannot be verified. Fortunately, the data was mostly cleanβthere were no missing values or duplicates in the columns relevant to my analysis.
Following the case study steps, I created several new columns:
ride_length: Calculates the duration of each ride in
HH:MM:SS format by subtracting start_time from
end_time.
day_of_week: Assigns a numeric value to each day (1 for
Sunday through 7 for Saturday).
day_of_week_str: Converts the numeric day into its name
(e.g., βMondayβ).
I used RStudio to visualize the data and generate insights. First, I installed and loaded the necessary packages:
if (!require("tidyverse")) {
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("tidyverse")
}
## Loading required package: tidyverse
## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
## β dplyr 1.1.4 β readr 2.1.5
## β forcats 1.0.0 β stringr 1.5.1
## β ggplot2 3.5.2 β tibble 3.3.0
## β lubridate 1.9.4 β tidyr 1.3.1
## β purrr 1.1.0
## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
## β dplyr::filter() masks stats::filter()
## β dplyr::lag() masks stats::lag()
## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if (!require("ggplot2")) {
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("ggplot2")
}
if (!require("plotly")) {
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("plotly")
}
## Loading required package: plotly
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(tidyverse)
library(ggplot2)
library(plotly)
library(readxl)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
Then, I imported the dataset from Excel:
Divvy_Trips_2019_Q1 <- read_excel("C:/Users/User/OneDrive/Desktop/Cyclists Case Study/XLS/Divvy_Trips_2019_Q1.xlsx")
π₯ Trips by User Type and Gender
max_user <- Divvy_Trips_2019_Q1 %>%
count(usertype, gender) %>%
filter(!is.na(gender)) %>%
pull(n) %>%
max()
user_plot <- Divvy_Trips_2019_Q1 %>%
filter(!is.na(gender)) %>%
ggplot(aes(x= usertype, fill = gender)) +
geom_bar(position = "dodge") +
scale_y_continuous(labels = label_comma()) +
labs(
title = "Trips by User Type and Gender",
x = "User Type",
y = "Number of Trips",
fill = "Gender"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 1),
axis.title.x = element_text(margin = margin(t = 15)),
axis.title.y = element_text(margin = margin(r = 15))
)
ggplotly(user_plot)
π Trips by Day of the Week
max_day <- Divvy_Trips_2019_Q1 %>%
count(day_of_week_str) %>%
pull(n) %>%
max()
day_plot <- ggplot(Divvy_Trips_2019_Q1, aes(x = day_of_week_str)) +
geom_bar(fill = "dark green") +
labs (
title = "Trips by Day of the Week",
x = "Day of the Week",
y = "Number of Trips"
) +
theme_minimal() +
theme (
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 1),
axis.title.x = element_text(margin = margin(t = 15)),
axis.title.y = element_text(margin = margin(r = 15))
)
ggplotly(day_plot)
π Trips by Birth Year
max_year <- Divvy_Trips_2019_Q1 %>%
filter(!is.na(birthyear)) %>%
count(birthyear) %>%
pull(n) %>%
max()
top_year <- Divvy_Trips_2019_Q1 %>%
filter(!is.na(birthyear)) %>%
count(birthyear) %>%
slice_max(n, n = 1) %>%
pull(birthyear)
year_plot <- ggplot(Divvy_Trips_2019_Q1, aes(x = birthyear)) +
geom_bar(fill = "dark blue") +
scale_x_continuous(limits = c(1940, 2000)) +
labs (
title = "Trips by Birth Year",
x = "Birth Year",
y = "Number of Trips"
) +
theme_minimal() +
theme (
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.85),
axis.title.x = element_text(margin = margin(t = 15)),
axis.title.y = element_text(margin = margin(r = 15))
)
ggplotly(year_plot)
## Warning: Removed 18425 rows containing non-finite outside the scale range
## (`stat_count()`).
I also calculated summary statistics to better understand the dataset:
mean_year <- Divvy_Trips_2019_Q1 %>%
summarise(mean_year = mean(birthyear, na.rm = TRUE)) %>%
pull(mean_year) %>%
round()
current_year <- as.numeric(format(Sys.Date(), "%Y"))
mean_age <- current_year - mean_year
Here are some key insights: